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Abstract 

The current article introduces a supervised learning algorithm for multilayer spiking 
neural networks. The algorithm presented here overcomes some limitations of existing 
learning algorithms as it can be applied to neurons firing multiple spikes and it can 
in principle be applied to any linear! sable neuron model. The algorithm is applied 
successfully to various benchmarks, such as the XOR problem and the Iris data set, 
as well as complex classifications problems. The simulations also show the flexibility 
of this supervised learning algorithm which permits different encodings of the spike 
timing patterns, including precise spike trains encoding. 

1 Introduction 

Traditional rate coded artificial neural networks represent an analog variable through 
the firing rate of the biological neuron. That is, the output of a computational unit is 
a representation of the firing rate of the biological neuron. In order to increase the 
computational power of the network the neurons are structured in successive layers 



of computational units. Such systems are trained to recognise the input patterns by 
searching for a set of suitable connections weights. Learning rules based on gradient 
decent, such as backpropagation ( [Rumelhart et al.[ |1986| ), have led sigmoidal neural 
networks (networks that use the sigmoid as the activation function) to be one of the 
most powerful and flexible computational models. 

However, experimental evidence suggests that neural systems use the exact time of 



single action potentials to encode information (Thorpe & Imbert 1989, Johansson & 



Birznieks 2004[ ). In Thorpe & Imbert ( 1989) it is argued that because of the speed of 



processing visual information and the anatomical structure of the visual system, pro- 



cessing has to be done on the basis of single spikes. In Johansson & Birznieks (2004) 
it is shown that the relative timing of the first spike contains important information 
about tactile stimuli. Further evidence suggests that the precise temporal firing pattern 



of groups of neurons conveys relevant sensory information ( |Wehr & Laurent[ |1996 
Neuenschwander & Singer[fl996[|deCharms & Merzenich[|1996| ). 



These findings have led to a new way of simulating neural networks based on tem- 
poral encoding with single spikes (Maass |1997a[ ). Investigations of the computational 
power of spiking neurons have illustrated that realistic mathematical models of neu- 
rons can arbitrarily approximate any continuous function, and furthermore, it has been 
demonstrated that networks of spiking neurons are computationally more powerful than 
sigmoidal neurons (Maass 1997b| ). Because of the nature of spiking neuron communi- 
cation, these are also suited for VSLI implementation with significant speed advantages 



dElias & Northmorej 120021 ). 

In this paper, we present a new learning algorithm for feed-forward spiking neural 



networks with multiple layers. The learning rule extends the ReSuMe algorithm (Ponu- 



lak & Kasihski[ 2010[ ) to multiple layers using backpropagation of the network error. 



The weights are updated according to STDP and anti-STDP processes and unlike Spike- 
Prop ( Bohte et al.[ 2002[ ) can be applied to neurons firing multiple spikes in all layers. 
The multilayer ReSuMe is analogue of the backpropagation learning algorithm for rate 
neurons, while making use of spiking neurons. To the best of our knowledge this is the 
first learning algorithm for spiking neuron networks with hidden layers where multiple 
spikes are considered in all layers, with precise spike time encoding can be read for 
both inputs and outputs. 
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The rest of the article is organised as follows: in the following section some of the 
existing supervised learning algorithms for spiking neurons are discussed. Section |3] 
contains a description of the generalised spiking neuron model and the derivation of the 
learning rule based on this neuron model for a feed-forward network with a hidden layer. 
In section|4]the weight modifications are analysed for a simplified network with a single 
output neuron. In section[5]the flexibility and power of the feed-forward spiking neural 
networks trained with multilayer ReSuMe are showcased by non-linear problems and 
classifications tasks. The spiking neural network is trained with spike timing patterns 
distributed over timescales in the range of tens to hundreds of milliseconds, comparable 
to the span of sensory and motor processing ( |Mauk & Buonomano||2004| ). A discussion 
of the learning algorithm and the results are presented in section|6] The article concludes 
with a summary of the article. 



2 Background 



Whereas experimental studies have shown that supervised learning is present in the 



brain, especially in sensorimotor networks and sensory systems ( Knudsen 1994 , 2002 ), 
there are no definite conclusions regarding the means through which biological neurons 
learn. Several learning algorithms have been proposed to explore how spiking neurons 
may respond to given instructions. 



One such algorithm, the tempotron learning rule, has been introduced by Giitig & 
Sompolinsky ( |2006 ), where neurons learn to discriminate between spatiotemporal se- 
quences of spike patterns. Although the learning rule uses a gradient-descent approach, 
it can only be applied to single-layered networks. The algorithm is used to train leaky 
integrate-and-fire neurons to distinguish between two classes of patterns by firing at 
least one action potential or by remaining quiescent. While the spiking neural network 
is able to successfully classify the spike-timing patterns, the neurons do not learn to 
respond with precise spike-timing patterns. 



Another gradient descent based learning rule is the SpikeProp algorithm (Bohte 
et al.[ |2002[ ) and its extensions ( [Schrauwen & Van Campenhout[ |2004[ |Xin & En> 
brechtsi [20011 [Booij & tat Nguyenj [2005| |Tino & MiUsj [2005] ). The algorithm is ap- 
plied to feed-forward networks of neurons firing a single spike and is minimizing the 
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time difference between the target spike and the actual output spike time. The learn- 



ing algorithm uses spiking neurons modelled by the Spike Response Model (Gerstner 



2001 ), and the derivations of the learning rule are based on the explicit dynamics of the 



neuron model. Although [Booij & tat Ngu}^ (2005) have extended the algorithm to 
allow neurons to fire multiple spikes in the input and hidden layers, subsequent spikes 
in the output layer are still ignored because the network error is represented by the time 
difference of the first spike of the target and output neurons. 

Some supervised learning algorithms are based on Hebb's postulate - "cells that fire 



together, wire together"" (Hebb 1949) - (Ruf & Schmitt 1997; Legenstein etalj 2005 



Ponulak & Kasihski[ |20T0j). ReSuMe ( [Ponulak & Kasihskij [lOTOl ) is making use of 



both Hebbian learning and gradient descent techniques. As the weight modifications 
are based only on the input and output spike trains and do not make any explicit as- 
sumptions about the neural or synaptic dependencies, the algorithm can be applied to 
various neuron models. However, the algorithm can only be applied to a single layer of 
neurons or used to train readouts for reservoir networks. 

The ReSuMe algorithm has also been applied to neural networks with a hidden 



layer, where weights of downstream neurons are subject to multiplicative scaling (Griining 
& Sporeal|2011 1. The simulations show that networks with one hidden layer can per- 



form non-linear logical operations, while networks without hidden layers cannot. The 
ReSuMe algorithm has also been used to train the output layer in a layer-feedforward 



network in Glackin et al. (201 1 ) and Wade et al. (2010 1 where the hidden layer acted as 
a frequency filter. However, input and target outputs here consisted of fixed-rate spike 
trains. 

Finally, there is a linear programming approach to estimating weights (and delays) 
in recurrent spiking networks based on LIF neurons ( jRostro-Gonzalez et al. , [2010 ) 
which successfully attempts to reconstruct weights from a spike raster and initial con- 
ditions such as a (fixed) input spike train and initial states of membrane potentials. 

The present paper introduces a new supervised learning algorithm that combines 



the quality of SpikeProp, spanning to multiple layers (Bohte et al. 2002), with the 
flexibility of ReSuMe, which can be used with multiple spikes and different neuron 



models (Ponulak &Kasihski 2010). 



4 



3 Learning algorithm 



In this section the new learning algorithm for feed-forward multilayer spiking neural 
networks is described. The learning rule is derived for networks with only one hidden 
layer, as the algorithm can be extended to networks with more hidden layers similarly. 



3.1 Neuron model 



The input and output signals of spiking neurons are represented by the timing of spikes. 
A spike train is defined as a sequence of impulses fired by a particular neuron at times 



t^. Spike trains are formalised by a sum of Dirac delta functions (Gerstner & Kistler 



2002): 



s{t) = j2^it-t^) (1) 

In order to analyse the relation between the input and output spike trains, we use the 



linear Poisson neuron model (Giitig et al. 2003 , Kempter et al. , 2001 1. Its instantaneous 
firing rate R{t) is formally defined as the expectation of the spike train, averaged over 
an infinite number of trials. An estimate of the instantaneous firing rate can be obtained 



by averaging over a finite number of trials (Heeger, 2001 1: 



1 ^ 

R{t) =< Sit) >=^5^5,(t) 



(2) 



where M is the number of trials and Sj{t) is the concrete spike train for each trial. In 
the linear Poisson model the spiking activity of the postsynaptic neuron o is defined by 
the instantaneous rate function: 



(3) 



where n is the number of presynaptic neurons h. The weights Woh represent the strength 
of the connection between the presynaptic neurons h and postsynaptic neuron o. The 
instantaneous firing rate R{t) will be used for the derivation of the learning algorithm 
due to its smoothness and subsequently be replaced by its discontinuous estimate, the 
spike train S{t). 
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3.2 Backpropagation of the network error 

The learning algorithm is derived for a fully connected feed-forward network with one 
hidden layer. The input layer / is only setting the input patterns without performing 
any computation on the patterns. The hidden and output layers are labelled H and O 
respectively. All neurons in one layer are connected to all neurons in the subsequent 
layer. 

The instantaneous network error is formally defined in terms of the difference be- 
tween the actual instantaneous firing rate -Ro(t) and the target instantaneous firing rate 
Ri{t) for all output neurons: 

E{t) = E{Ri{t)) = \ J2 Wt) - Riit)T (4) 

oeo 

In order to minimise the network error, the weights are modified using a process of 
gradient descent: 

Awohit) = -V ^ °^ (5) 

OWoh 

where r] is the learning rate and Woh represents the weight between the output neuron o 
and hidden neuron h. Awohit) is the weight change contribution due to the error E{t) 
at time t, and the total weight change is Aw = J Aw{t)dt over the duration of the spike 
train. This is analogue to the starting point of standard backpropagation for rate neurons 
in discrete time. For simplicity, the learning rate will be considered r] = 1 and will be 
suppressed in all following equations, as the step length of each learning iteration will 
be given by other learning parameters to be defined later on. Also, in the following 
derivatives are understood in a functional sense. 

Weight modifications for the output neurons 

In this section we re-derive the weight-update formulate for the ReSuMe learning algo- 
rithm and connect with gradient-descent learning for linear Poisson-neurons. We will 
need this derivation as a first step to a derive our extension of ReSuMe to subsequent 
layers in the next subsection. However, this derivation is also instructive in its own 



right as it works out a bit more rigorously than in the original derivation (Ponulak & 



Kasinski 2010) how ReSuMe and gradient descent are connected. It also makes Ponu- 



lak's statement more explicit that ReSuMe can be apphed to any neuron model. This is 
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then the case if the neural model can on an appropriate time scale be approximated well 
enough with a linear neuron model. 

As the network error is a function of the output spike train, which in turn depends 
on the weight Woh, the derivative of the error function can be expanded using the chain 
rule as follows: 

dmiit)) ^ dE{R:{t)) dRl{t) 
dwoh dRt{t) dwoh 

The first term of the right-hand part of equation Q can be calculated as: 

dE{R'^{t)) ^ 
dRiit) -^o{t)-Ro{t) (7) 

Since the instantaneous rate function is expressed in terms of the weight Woh in (|3]), the 
second factor of the right-hand side of equation (|6]) becomes: 

^ ^R.(t) (8) 

where Uh is the number of hidden neurons. By combining equations (|5]) - ([8]), the 
formula for weight modifications to the output neurons becomes: 

^wohit) = -- [Rlit) - Rtit)] Ru(t) (9) 

For convenience we define the backpropagated error 5o{t) for the output neuron o: 

5,{t) := - [Riit) - Rl{t)\ (10) 

hence: 

^Wolit) = bo{t)Rh{t) (11) 

This is similar to standard discrete-time backpropagation, however now derived as a 
functional derivative in continuous time. In the following we will use the best estimation 
of the unknown instantaneous firing rate R{t) when we only have a single spike train 
S'(t), which is the spike train itself for each of the neurons involved. Thus the weights 
will be modified according to: 

Awo,(t) = - K(t) - Sl{t)\ Shit) (12) 

However, products of Dirac 5 functions are mathematically problematic. Following 



Ponulak & Kasihski (2010) the non-linear product of S^{t)Sh{t) is substituted with a 



STDP process. In a similar manner, S^{t)Sh{t) is substituted with an anti-STDP process 



(for details see |Ponulak & KasinskIl ( |2010D ). 



a+ I aP'%s)S^{t- s)ds 



oo 

post 



a + I a'' 





{s)Sh{t — s)ds 



(13) 



s:it)Sf,it) -> - Shit) 



a+ I aP''^{s)S^{t- s)ds 



POO 

a+ aP'''\s)Shit-s)ds 
Jo 



(14) 



where a > is a non-Hebbian term that guarantees the weight changes in the correct 
direction if the output spike train contains more or less spikes than the target spike train. 

The integration variable s represents the time difference between the actual firing 
time of the output neuron and the firing time of the hidden neuron s = (tl — tl), and 
the target firing time and the firing time of the hidden neuron s = (t^ — tl) respectively. 
The kernel a^^'^(s) gives the weight change if the presynaptic spike (the spike of the 
hidden neuron occurs) comes after the postsynaptic spike (the spikes of the output and 
target neurons). The kernel aP°**(s) gives the weight change if the presynaptic spike 
before the postsynaptic spike. The kernels a^"^^ and a^°''* define the learning window 



W{s) (Gerstner&Kistler, 2002) 



-.pre/ 



W{s) 



..post I 



s) = -A_exp(^), ifs<0 
s) = +A+ exp(— ), if s > 



(15) 



where A^, A_ > are the amplitudes and t^, > are the time constants of the 
learning window. Thus the final learning formula for the weight modifications becomes: 



^Wohit) = — Shit) 
rih 



(16) 



The total weight change is obtained by integrating equation ( fT6| ) over time on a time 
domain that covers all the spikes in the system. This equation is the core of ReSuMe 



learning algorithm as stated in Ponulak & Kasihski (2010). 
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Weight modifications for the hidden neurons 

In this section we extend the argument above to weight changes between the input and 
the hidden layer. The weight modifications for the hidden neurons are calculated in a 
similar manner in the negative gradient direction: 

. (,7) 

OWhi 

The derivative of the error is expanded similarly as in equation (|6]) (again in the sense 
of functional derivatives): 

dE{R^,it)) _ dEjR-it)) dR,it) ^^g^ 
dwhi dRh(t) dwhi 

The first factor of the right-hand part of the above equation is expanded for each output 

neuron using the chain rule: 

dE{R-it)) ^ dE{Rl{t)) dRl{t) 



dR^it) ^ dR^M dRu{t) ^ 

The second factor of the right-hand side of the above equation is calculated from equa- 
tion ([3]): 

dRh{t) Uh 

The derivatives of the error with respect to the output spike train have already been 
calculated for the weights to the output neurons in equation (|7]). By combining these 
results: 



The second factor of the right-hand part of equation ( [T8| ) is calculated as follows using 
again equation ([3]): 

^ '-P.it) (22) 

where rii is the number of input neurons. By combining equations ( [T7| ) - ( [22] ), the 
formula for the weight modifications to the hidden neurons becomes: 

Awhiit) = [Kit) - Rtit)] R^{t)woh (23) 

We define the backpropagated error Shii) for layers other than the output layer: 

Shit) ■■= -y^So{t)woh (24) 



Just like in standard backpropagation 5o{t) are backpropagated errors of the neurons 
in the preceding layer. By substituting the instantaneous firing rates with the spike trains 
as estimators, equation ( [23] ) becomes: 

^Whiit) = — Yl i^oit) - S:it)] S,it)woh (25) 



rihUi 



o£0 



We now want to repeat the procedure of replacing the product of two spike trains (in- 



volving (ie/ta-distributions) with a STDP process. We note first that equation p5] ) does 
not depend any longer on any spikes fired or not fired in the hidden layer. While there 
are neurobiological plasticity processes that can convey information about a transmitted 
spike from the effected synapses to lateral or downstream synapses (for an overview see 
Harris||2008 1, no direct neurobiological basis is known for an STDP process between a 



synapse and the outgoing spikes of an upstream neuron. Therefore this substitution is 
to be seen as a computational analogy, and the weights will be modified according to: 



1 r°° 

n^n, ^ [Jo 

1 r 

+ y [St{t) - 5:(t)l a + / aP"'\s)Si.{t - s)ds 

^i^h ^ ' I Jo 



(26) 



Woh 



The total weight change is again determined by integrating equation ( [26| ) over time. 
The synaptic weights between the input and hidden neurons are modified according to 
STDP processes between the input and target spikes and anti-STDP processes between 
input and output spikes. 



Normalisation 

The normalisation to the number of presynaptic connections of the modifications of the 
weights to the output neurons ensures that the changes are proportional to the num- 
ber of weights. Moreover, the learning parameters do not need to change as the net- 
work architecture changes (for example, in order to keep the firing rate of postsynaptic 
neurons constant as the number of presynaptic units changes, the initial weights and 
weight modifications also must change accordingly). The normalisation to the number 
of presynaptic and postsynaptic connections of the weight modifications to the hidden 
neurons ensures that the changes of the connections between the input and hidden layer 
are usually smaller than the changes of the connections between the hidden and output 
layer, which keeps the learning process stable. 
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Generalisation 



The algorithm can be generalised in this manner for neural networks with multiple hid- 
den layers. The learning rule could also be generalised for recurrent connections (e.g. 
using unrolling in time as in backpropagation through time ( Rojas[ 1996)), however in 
the present paper we only consider feed-forward connections. This is our extension of 
ReSuMe to hidden layers following from error minimisation and gradient descent. 

As the learning rule for the weight modifications depends only on the presynaptic 
and postsynaptic spike trains and the current strength of the connections between the 
spiking neurons, the algorithm can be applied to various spiking neuron models, as 
long as the model can be sufficiently well approximated on an appropriate time scale 
as in equation (|2]). Although Ponulak & Kasiriski (2010) do not explicitly use any 
neuron model for the derivation of the ReSuMe algorithm, implicitly a linear neuron 
model is assumed. The algorithm has successfully been applied to leaky integrate- and- 



fire neurons, Hodgkin-Huxley, and Izhikevich neuron models (Ponulak & Kasinski 



2010). Since the present learning rule is an extension of ReSuMe to neural networks 
with multiple layers, this is an indication that this algorithm will function with similar 
neuron models, as we demonstrate in the following section. 



Inhibitory connections 

Inhibitory connections are represented by negative weights which are updated in the 
same manner as positive weights. However, for the calculation of the backpropagation 



error of the hidden neurons 6h{t) in equation ( [3T] ), the absolute value of the output 
weights will be used. This is a deviation from the gradient descent rule, but using the 
absolute values guarantees that the weights between the input and hidden neurons are 
always modified in the same direction as between hidden and output neurons: 



1 r r°° 



Woh 



oo 

a+ I aP°'\s)Si{t- s)ds 





(27) 



\Woh\- 



Preliminary simulations have shown this results in better convergence of the learning 
algorithm. There is also neurobiological evidence that LTD and LTP spread to down- 
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stream synapses ( |Tao et al^ |2000^ |Fitzsimonds et al4 11997[ ), i.e. that weight changes 
with the same direction propagation from upstream to downstream neurons. 



Delayed sub-connections 

If one considers a network architecture where all the neurons in one layer are connected 
to all neurons in the subsequent layer through multiple sub-connections with different 



delays d^, where each sub-connection has a different weight (jBohte et al. 2002), the 



learning rule for the weight modifications for the output neurons will become: 

A7/;„\ = 5„(t)i?,(t-d„\) (28) 

where is the weight between output neuron o and hidden neuron h delayed by d^^^ 
ms. The backpropagated error for the output is then: 

m = — [Kit) -Kit)] (29) 

mrih 

where m is the number of sub-connections. The learning rule for the weight modifica- 
tions for any hidden layer is derived similarly as: 

^wl = 5Hmiit-dl^) (30) 

where 5h{t) is the backpropagated error calculated over all possible backward paths 
(from all output neurons through all delayed sub-connections): 

^hit) = — E ^^'"'oh (31) 

The algorithm can be generalised for neural networks with multiple hidden layers 
and delays similarly. 

3.3 Synaptic scaling 

There has been extensive evidence that suggests that spike-timing dependent plasticity 



is not the only form of plasticity (Watt & Desai, 2010). Another plasticity mechanism 



used to stabilise the neurons activity is synaptic scaling ( jShepard et al.[ 12009] ). Synaptic 



scaling regulates the strength of synapses in order to keep the neuron's firing rate within 
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a particular range. The synaptic weights are scaled multiplicatively, this way maintain- 
ing the relative differences in strength between any inputs (Watt & Desai, 2010[ ). 



In our network, in addition to the learning rule described above, the weights are also 
modified according to synaptic scaling in order to keep the postsynaptic neuron firing 
rate within an optimal range [rmin,rmax]- If a weight Wij from neuron i to neuron j 
causes the postsynaptic neuron to fire with a rate outside the optimal range, the weights 
are scaled according to the following formula ( [Gruning & Sporeal|2011| ): 



(1 + f)wij,Wij>0 
Wij = <^ (32) 

j^Wij,Wij < 

where the scaling factor / > for rj < rmin, and / < for rj > r^ax- 

Synaptic scaling solves the problem of optimal weight initialisation. It was observed 
that the initial values of the weights have a significant influence on the learning process, 
as too large or too low values may result in failure of the learning ( |Bohte et"aL , 2002 1. 



Preliminary experiments show that a feed-forward network can still learn reliably sim- 
ple spike trains without synaptic scaling as long as the weights are initialised within an 
optimal range. However, as the target patterns contain more spikes, finding the optimal 
initial values for the weights becomes difficult. Moreover, as the firing rate of the target 
neurons increases, it becomes harder to maintain the output neurons firing rate within 
the target range without using minimal learning steps. The introduction of synaptic 
scaling solves the problem of weights initialisation as well as speeds up the learning 
process. 



4 Heuristic discussion of the learning rule 

In order to analyse the direction in which the weights change during the learning process 



using equations ( [T6| ) and ( [27| ), we will consider a simple three layer network. The 
output layer consists of a single neuron. The neurons are connected through a single 
sub-connection with no delay. For clarity, in this section spike trains will comprise only 
a single spike. Let and ta denote the desired and actual spike time of output neuron 
o, and th and ti the spikes times of the hidden neuron h and input neuron i respectively. 
Also, for simplicity, synaptic scaling will not be considered here. 
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For a start we assume to, td > th, U, i.e. where relevant post synaptic spikes occur 



after the pre-synaptic spikes. With these assumptions ([16]) and ( [27] ) read after integrating 
out: 

^Woh = — ( A+ exp — — — - exp — — —] , (33) 

Awhi = \woh\(A+exp- — —-A^exp- — —] . (34) 

We discuss this case only in the following and note the case to,td < th,ti (i.e. 
post-before-pre) can be discussed along the same lines with above replaced by A_. 
We discussed the following cases: 

1. The output neuron fires a spike at time to before the target firing time td (to < td). 

(a) Weight modifications for the synapses between the output and hidden neu- 
rons. The weights are modified according to /^Woh = ^{A+exp ^i^-^ — 
A+exp^^^). Since to < td then exp(^^^) > exp (^^^) in equation 



( |33| ). This results in Awoh < 0, and thus in a decrease of this weight. If 
the connection is an excitatory one, the connection becomes less excitatory, 
increasing the likelihood, that the output neuron fires later during the next 
iteration, hence minimising the difference between the actual output and the 
target firing time. If the connection is inhibitory, the connection will become 
stronger inhibitory, resulting in a later firing of the output neuron o as well 
(see also |Ponulak1 ( |2006l )). 



(b) Weight modifications for the synapses between the hidden and input neu- 
rons. The weights to the hidden neurons are modified according to: Aw hi = 

_J_(A+exp *^ - A+exp '-^)\woh\. 

(i) Woh > 0. By an analogue reasoning to the case above Aw hi, and hence 
the connection will become less excitatory and more inhibitory, again 
making the hidden neuron fire a bit later, and hence making it more 
likely that also the output neuron fires later as the connection from hid- 
den to output layer is excitatory. 

(ii) Woh < 0. For the hidden neuron the effect stays the same, hence it will 
fire later. As it is now more likely to fire later, its inhibitory effect will 
come to bear on the output neurons also a bit later. 
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2. The output neuron fires a spike at time to after the target firing time td (to > td). 



As p3\ and ( [34] ) change their sign when to and t^ are swapped, this case reduces 
to the above, but with the opposite sign of the weight change, i.e. overall weight 
change such that to moves forward in time, close to td- 

Cases where there is only an actual spike at to and no desired spike or where there 
is only a desired spike at td can be dealt with under the above cases if one sets td = oo 
or to = oo respectively. In addition there will be a contribution from the factor a in 



equations ([17]) and (27 ), and this has the same sign as the one from (33 ) and (34). 



5 Simulations 

In this section several experiments are presented to illustrate the learning capabilities 
of the algorithm. The algorithm is applied to classic benchmarks, the XOR problem 
and the Iris data set, as well as to classification tasks with randomly generated patterns. 
The XOR problem is applied using two different encoding methods to demonstrate the 
flexibility of our learning algorithm. The learning rule is also applied to classification 
problems of spike timing patterns which range from 100 ms to 500 ms in order to 
simulate sensory and motor processing in biological systems. 



Setup 

The network used for the following simulations is a feed- forward architecture with three 
layers. The neurons are described by the Spike Response Model (Gerstner 200 1| ) (see 
the appendix for a complete description). 

For all simulations, an iteration consists of presenting all spike timing pattern pairs 
in random order. The membrane potential of all neurons in the hidden and output layers 
is set to the resting potential (set to zero) when presenting a new input pattern. Af- 
ter each presentation of the input pattern to the network, the weight modifications are 
computed for all layers and then applied. We apply the weight changes after the back- 
propagated error is computed for all units in the network. The summed network error 
is calculated for all patterns and tested against a required minimum value, depending 
on the experiment (see the appendix for details on the network error). This minimum 
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value is chosen in order to guarantee that the network has learnt to correctly classify all 
the patterns with an acceptable precision. 

The results are averaged over a large number of trials (50 trials unless stated other- 
wise), with the network being initialised with a new set of random weights every trial. 
On each testing trial the learning algorithm is applied for a maximum of 2000 iterations 
or until the network error has reached the minimum value. 

The learning is considered converged if the network error has reached a minimum 
value, depending on the experiment. Additional constrains for the convergence of the 
learning algorithm are considered in Sections 5.3 to 5.5 in order to ensure the network 
has learnt to correctly classify all the patterns. For all simulations, the averaged number 
of iterations needed for convergence is calculated over the successful trials. The accu- 
racy rate is defined as the percentage of correctly classified patterns calculated over the 
successful trials. 

Unless stated otherwise, the network parameters used in these simulations are: the 
threshold i!) = 0.7, the time constant of the spike response function r = 7 ms, the time 
constant of after-potential kernel r,. = 12 ms. The scaling factor is set to / = ±0.005. 
The learning parameters are initialised as follows: = 1.2, A_ = 0.5, r+ = r_ = 5 
ms, a = 0.05. 

The weights were initialised with random values uniformly distributed between - 
0.2 and 0.8. The weights are then normalised by dividing them to the total number of 
sub-connections. 



5.1 The XOR benchmark 

In order to demonstrate and analyse the new learning rule, the algorithm is applied to 
the XOR problem. While this benchmark does not require generalising, the XOR logic 
gate is a non-linear problem and it is a classical benchmark for testing the learning 



algorithm's ability to train non-trivial input output transformations (Rojas, 1996) 



Technical details 



The input and output patterns are encoded using spike-time patterns as in Bohte et al. 



(2002). The signals are associated with single spikes as follows: a binary symbol "0" 
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is associated with a late firing (a spike at 6 ms for the input pattern) and a "1" is asso- 
ciated with an early firing (a spike at ms for the input pattern). We also used a third 
input neuron that designates the reference start time as this encoding needs an absolute 
reference start time to determine the latency of the firing ( Sporea & Griiningf 2011 1. 
Without a reference start time, two of the input patterns become identical and without 
an absolute reference time, the network is unable to distinguish the two patterns (0-0 
and 6-6) and would always respond with a delayed output. Table [T] shows the input and 
target spike timing patterns that are presented to the network. The values represent the 
times of the spikes for each input and target neuron in ms of simulated time. 



Table 1: Input and output spike-time patterns. The patterns consists of the timing of 
single spikes in ms of simulated time for the input and target neurons. 



Input [ms] 


Output [ms] 











16 





6 





10 


6 








10 


6 


6 
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The learning algorithm was applied to a feed-forward network as described above. 
The input layer is composed of three neurons, the hidden layer contains five spiking 
neurons, and the output layer contains only one neuron. Multiple sub-connections with 
different delays were used for each connection in the spiking neural network. Prelimi- 
nary experiments showed that 12 sub-connections with delays from ms to 11 ms are 
sufficient to learn the XOR problem. The results are averaged over 100 trials. The 
network error is summed over all pattern pairs, with a minimum value for convergence 
of 0.2. The minimum value is chosen to ensure that the network has learnt to classify 
all patterns correctly, by matching the exact number of spikes of the target spike train 
as well as the timing of the spikes with 1 ms precision. Each spiking neuron in the 
network was simulated for a time window of 30 ms, with a time step of 0.1 ms. In the 
following we systematically vary the parameters of the learning algorithm and examine 
their effects. 
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The learning parameters 



Here, we vary the learning parameters and A_ in equation ( fT5] ) in order to determine 
the most appropriate values. is varied between 0.5 and 2.0, while keeping A_ = 
\Aj^. Table|2^ shows the summarised results. 



Table 2: Summarised results for the XOR problem: (a) The parameters A^ and A^ are 
varied in order to determine the best values for faster convergence. The ratio between 
these parameters is constant A^ = 2A_. (b) While keeping A^ = 1.2 fixed, A_ is 
varied in order to determine the best ratio between these parameters. 



(a) 



(b) 





Successful 
trials [%] 


Average number 
of iterations 


0.5 


97 


331 ± 46 


0.6 


98 


232 ± 24 


0.7 


95 


262 ± 38 


0.8 


97 


144 ± 35 


0.9 


96 


184 ± 23 


1.0 


96 


204 ± 34 


1.1 


92 


166 ± 27 


1.2 


96 


207 ±31 


1.3 


95 


174 ± 30 


1.4 


97 


183 ± 28 


1.5 


93 


204 ± 36 


1.6 


93 


273 ± 43 


1.7 


96 


163 ± 27 


1.8 


94 


181 ±32 


1.9 


95 


221 ± 32 


2.0 


89 


141 ± 18 





Successful 
trials [%] 


Average number 
of iterations 


0.00 


97 


231 ± 30 


0.10 


98 


196 ± 24 


0.20 


96 


157 ± 16 


0.30 


96 


187 ±28 


0.40 


95 


204 ± 37 


0.50 


98 


137 ± 16 


0.60 


96 


207 ±31 


0.70 


95 


191 ±33 


0.80 


98 


185 ±31 


0.90 


86 


203 ±31 


1.00 


88 


200 ± 30 


1.10 


80 


257 ±33 


1.20 


70 


349 ± 42 


1.30 


65 


382 ± 30 


1.40 


45 


353 ± 28 


1.50 


56 


492 ± 32 



The parameters A^ and A^ play the role of a learning rate. Just like the classic 
back-propagation algorithm for rate neurons, when the learning parameters have higher 
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values the number of iterations needed for convergence is lower. In order to determine 
the best ratio between the two learning parameters, various values are chosen for A_, 
while keeping = 1.2 fixed. The results are summarised in Table[2j5. 

The learning algorithm is able to converge for the values of A_ lower than A^. As 
A^ becomes equal or higher than A+, the convergence rate slowly decreases and the 
number of iterations needed for convergence significantly rises. The lowest average 
number of iterations with a high convergence rate is 137 averaged over 98% successful 
trials. 

Number of sub-connections 

The algorithm also converges when the spiking neural network has a smaller number 
of sub-connections. However, a lower number of delayed sub-connections results in a 
lower convergence rate without necessarily a lower average of learning iterations for the 
successful trials. Although more sub-connections can produce a more stable learning 
process, due to the larger number of weights that need to be coordinated, the learning 
process is slower in this case. Table |3] shows the summarised results, where A^ = 1.2 
and A_ = 0.6. 

Table 3: The number of delayed sub-connection is varied while keeping the learning 
parameters fixed A+ = 1.2 and A^ = 0.6. 



Sub- 


Successful 


Average number 


connections 


trials [%] 


of iterations 


4 


11 


63 ±20 


6 


24 


169 ± 37 


8 


73 


192 ±27 


10 


81 


154 ± 17 


12 


96 


207 ±31 


14 


96 


309 ± 52 


16 


73 


472 ± 56 
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Analysis of learning process 

In order to analyse the learning process, the network error and the weight vector dur- 
ing the learning process can be seen in Figure [T] (y4+ = 1.2, A_ = 0.6, and 12 sub- 
connections). Figure [T^ shows the evolution of the summed network error during learn- 
ing. Figure [IJ5 shows the Euclidean distance between the weight vector solution found 
on a trial and the weight vectors during each learning iteration that led to this weight 
vector. The weight vectors are tested against the solution found because there can be 
multiple weight vectors solutions. While the error graph is irregular, the weight vector 
graph shows that the weight vector moves steadily towards the solution. The irregular- 
ity of the network error during the learning process can be explained by the fact that 
small changes to the weights can produce an additional or missing output spike, which 
causes significant changes in the network error. The highest error value corresponds to 
the network not firing any spike for any of the four input patterns. The error graph also 
shows the learning rule ability to modify the weights in order to produce the correct 
number of output spikes. 




Iterations Iterations 



(a) (b) 

Figure 1: Analysis of the learning process with the parameters = 1.2 and A_ = 0.5: 
(a) The network error during learning, (b) The Euclidean distance between the weight 
vector solution and the weight vectors during the learning process. 
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5.2 The Iris benchmark 



Another classic benchmark of pattern recognition is Fisher's Iris flower data set (jFisher 



1936). The data set contains three classes of Iris flowers. While one of the classes is 



linearly separable from the other two, the other two classes are not linearly separable 
from each other. 



Technical details 

The three species are completely described by four measurements of the plants: the 
lengths and wights of the petal and sepal. Each measurement has associated an input 
neuron and the input pattern consists of the timing of a single spike. The measurements 
of the Iris flower range from to 8 and are fed into the spiking neural network as spike 
timing patterns to the input neurons. The output of the network is represented by the 
spike-time of the output neuron, as seen in Table |4} The hidden layer contains ten 
spiking neurons and each connection has between 8 and 12 delayed sub-connections 
depending on the experiment. 

Table 4: The target neuron's spike train contains a single spike, where the timing (shown 
in ms) differs for each of the three patterns. 



Species 


Output spike-time [ms] 


I. setosa 


10 


I. versicolor 


14 


I. virginica 
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During each trial, the input patterns are randomly divided into a training set (75% of 
samples) and a testing set (25% of samples) for cross validation. During each iteration, 
the training set is used for the learning process to calculate the weight modifications 
and to test if the network has learnt the patterns. The learning is considered successful 
if the network error has reach a minimum average value of 0.2 for each pattern pair 
and 95% of the patterns in the training set are correctly classified. As in the previous 
experiment, this minimum value is chosen to ensures that the network has learnt to 
classify all patterns correctly, by matching the exact number of spikes of the target spike 
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train as well as timing of the spikes with 1 ms precision. Table [5] shows the summarised 
results on the Iris data set for different network architectures with different numbers of 
delayed sub-connections. 



Table 5: Summarised results for the Iris data set. 



Sub- 


Successful 


Average number 


Accuracy on the 


Accuracy on the 


connections 


trials 


of iterations 


training set [%] 


testing set [%] 


8 


68 


125 ± 12 


97 ±0.17 


89 ±0.69 


9 


80 


174 ± 16 


96 ±0.00 


94 ± 0.79 


10 


80 


114 ± 13 


97 ±0.00 


89 ±0.47 


11 


74 


140 ± 15 


96 ±0.16 


86 ± 0.49 


12 


68 


183 ± 21 


96 ±0.17 


91 ±0.69 



Multi-layer ReSuMe permits the spiking neural network to learn the Iris data set 
using a straight forward encoding of the patterns and results in much faster learning than 
SpikeProp, as the average number of iterations is always lower than 200, as opposed to 
the population coding based on arrays of receptive fields that requires 1000 iterations 



for learning (Bohteetal.j 2002). 



5.3 Non-linear spike train pattern classification 

In this experiment the learning algorithm is tested on non-linear transformation of se- 
quences of spikes. Again, the XOR problem is applied to a network of spiking neurons, 
but the logic patterns are encoded by spike trains over a group of neurons, and not single 
spikes (see also |Gruning & Sporea| ( |2011"] )). 



While the encoding for the XOR logic gate problem introduced by Bohte et al 



( 2002[ ) requires neurons to fire a single spike, the network of spiking neurons needs a 



large number of sub-connections with different delays to enable the hidden and output 
neurons to fire at the desired times. As the problem becomes more complex such encod- 
ing might need even more sub-connections which have to be trained. The large number 
of weights to be trained slows down the learning process because of the large number 
of incoming spikes that need to be coordinated to produce the requires output. This can 
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also be seen in the previous simulations on the XOR problem where the network with 
14 terminals, although learning the patterns it needed almost twice as many iterations to 
converge as the network with 12 terminals. Moreover, it has been shown that encoding 
logical true and false with early and late spike times respectively also requires an addi- 
tional input neuron to designate the reference start time. Without the additional input 
neuron, even linear problems become impossible to solve (for a complete demonstra- 
tion, see |Sporea & Gruniligl ( |2011| )). 

A more natural encoding would consist of the temporal firing patterns of groups 
of neurons ( |Wehr & Laurent[ |1996[ |Neuenschwander & Singer[ |1996^ |deCharms & 



Merzenich 1996 1. In order to test such an encoding and the learning algorithm's ability 
to learn non-linear patterns, the XOR problem is applied once again to a spiking neural 
network. In this experiment the two logical values will be encoded with spike trains 
over two groups of input neurons. This encoding will not necessitate multiple delays 
nor the additional input neuron. In all the following experiments, a single connection 
with no delay will be used. 

Technical details 

Each input logical value is associated with the spike trains of a group of 20 spiking 
neurons. In order to ensure some dissimilarity between the patterns, for each input 
neuron a spike train is generated by a pseudo Poisson process with a constant firing rate 
of r = 0.06 within a 30 ms time window. The minimum inter spike interval is set to 
3 ms. This spike train is then split in two new spike trains by randomly distributing 
all the spikes ( |Gruning & Sporeaj 2011 1. The newly created spike trains will represent 



the patterns for the logical symbols "0" and "1". The input spike trains are required to 
consist of at least one spike. 

The output patterns are created similarly and will be produced by one output neuron. 
The spike train to be split is generated by a pseudo Poisson process with a constant firing 
rate of r = 0.2 within a 30 ms period of time. The resulting output patterns are chosen 
so that the spike trains contain exactly three spike. 

Apart from the minimal network error as before, an additional stopping criterion for 
the learning process is introduced. The network must correctly classify all four patterns. 
An input pattern is considered correctly classified if the output spike train is closest to 
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the target pattern in terms of the van Rossum distance. The network error consist of the 
sum of van Rossum distances between the target and actual output over the four patterns 
as before; a minimum value of 3 ensures that the output spikes are reproduced with an 
acceptable precision. 

In addition to the previous experiments, an absolute refractory period is set for all 
neurons to t = 3 ms. The learning is simulated over a period of 50 ms, with a time step 
of 0.5 ms. 

In order to determine the optimal size of the hidden layer for a higher convergence 
rate, different network topologies have been considered. Table|6]shows the convergence 
rate for each network topology, with a new set of spike-timing patterns being generated 
every trial. 

Table 6: Summarised results for the non-linear classifications task. 



Hidden 


Successful 


Average number 


neurons 


trials [%] 


of iterations 


50 


70 


293 ± 59 


60 


54 


301 ± 66 


70 


56 


327 ±91 


80 


60 


469 ± 87 


90 


76 


247 ± 42 


100 


76 


439 ± 73 



The learning rule is able to converge with a higher rate as the number of neurons in 
the hidden layer increases; a larger hidden layer means that the patterns are distributed 
over a wider spiking activity and easier to be classified by the output neuron. A smaller 
number of neurons in the hidden layer than in the input layer does not result in high 
convergence rate because the input patterns are not sufficiently distributed in the hidden 
activity. Also, more than 100 units in the hidden layer does not result in higher conver- 
gence rates, but as the number of weights also increases the learning process is slower. 
Previous simulations ( Griining & Sporeaf 2011 1 show that a neural network without a 
hidden layer cannot learn non-linear logical operations. 
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5.4 Learning sequences of temporal patterns 

In this experiment, we consider the learning algorithm's ability to train a spiking neural 
network with multiple input-target pattern pairs. The network is trained with random 
non-noisy spike train patterns and tested against noisy versions of the temporal patterns. 



Technical details 

The input patterns are generated by a pseudo Poisson process with a constant firing 
rate of r = 0.05 within a 100 ms period of time, where the spike trains are chosen 
so that they contain at least one spike. In order to ensure that a solution exists, the 
target patterns are generated as the output of a spiking neural network initialised with 
a random set of weights. The target spike trains are chosen so they contain at least two 
spikes and no more than four spikes. If the output patterns were random spike trains. 



a solution might not be representable in the weight space of the network (Legenstein 



etal. 2005). 



The learning is considered to have converged if the network error reaches an average 
value of 0.5 for each pattern pair. Apart from the minimum error, the network must also 
correctly classify at least 90% of the pattern pairs, where the patterns are classified 
according the van Rossum distance (see the appendix for details). 



The size of the hidden layer 

In order to determine how the structure of the neural network influences the number of 
patterns that can be learnt, different architectures have been tested. In these simulations, 
100 input neurons are considered in order to have a distributed firing activity for the 
simulated time period. The output layer contains a single neuron as in the previous 
simulations. The size of the hidden layer is varied from 200 to 300 neurons to determine 
the optimal size for storing 10 input-output pattern pairs. The results are summarised 
in Table |7^. The network is able to perform better as the number of hidden neurons 
increases. However, a hidden layer with more 260 neurons does not result in a higher 
convergence rate. 
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Table 7: Summarised results for the classification task, (a) The network is trained with 
10 pattern pairs, where the size of the hidden layer is varied in order to determine the 
best network architecture, (b) A neural network with a hidden layer containing 260 
neurons is trained with different numbers of pattern pairs. 



(a) 



(b) 



Miimhpt* of* 








trials r%l 


rmmVipr nf 

11 UlllU^l VJ1_ 


units 




iterations 


200 


50 


5 ± 0.8 


210 


52 


6 ± 1.2 


220 


78 


5 ± 0.6 


230 


76 


6± 1.1 


240 


80 


5 ±0.6 


250 


74 


7 ±0.8 


260 


90 


5 ±0.7 


270 


88 


4 ±0.5 


280 


80 


7± 2.4 


290 


90 


4 ±0.6 


300 


90 


4 ±0.4 



MiimhpT* of* 


^IlPPPQcFlll 

o Ln.^ c c o a i u. 1 






trials r%l 


nnrnVipr nf 

llUlllL/^l KJl. 


units 




iterations 


5 


100 


7 ± 0.7 


6 


92 


5 ± 0.6 


7 


96 


5 ± 1.2 


8 


92 


8± 1.5 


9 


88 


7 ±0.9 


10 


90 


6 ±0.6 


11 


72 


6 ±0.7 


12 


72 


6 ±0.7 


13 


58 


5 ±0.9 


14 


40 


6 ±0.9 


15 


34 


5± 1.0 



Number of patterns 

The networks architecture that performed best with the lowest number of neurons (260 
neurons in the hidden layer) was trained with different numbers of patterns. The results 
for different number of patterns are summarised in Table [TJ?. The network is able to 
store more patterns, but the convergence rate drops as the number of patterns increases. 
Because the target patterns are the output spike trains of a randomly initialised spiking 
neural network, as the number of pattern pairs increases, the target spike trains become 
necessarily more similar. Hence, the network's responses to the input patterns become 
more similar and more easily misclassified. Since the stopping criterion requires the 
network to correctly classify the input patterns, the convergence rate drops as the num- 
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ber of pattern pairs increases. 

Since the target patterns are generated as the output spike trains of a network with 
a set of random weights, this vector of weights can be considered the solution of the 
learning process. However when looking at the Euclidean distance between the weight 
vector solution and the weight vectors during learning, the distance is increasing as 
the learning process progresses. The learning algorithm does not find the same weight 
vector as the solution, so multiple solutions of weight vectors to the same problem exist 
(for example permutations of hidden neurons is the simplest one). 



Noise 

After the learning has converged, the networks are also tested against noisy patterns. 
The noisy patterns are generated by moving each spike within a gaussian distribution 
with mean and standard deviation between 1 and 10 ms. After the network has learnt 
all patterns, the network is tested with a random set of 500 noisy patterns. Figure 
shows the accuracy rate (the percentage of input patterns that are correctly classified) 
for the network with 260 spiking neurons in the hidden layer trained with 10 pattern 
pairs. The accuracy rates are similar for all the networks described above. The network 
is able to recognise more than 20% (above the random performance level of 10%) of 
the patterns when these are distorted with 10 ms. 



5.5 Learning to Generalise 

In this experiment, the learning algorithm is tested in the presence of noise. In the 
previous experiments where random patterns were randomly generated, the learning 
occurred in noise free conditions. A spiking neural network is trained to recognise 
temporal patterns on the timescale of hundreds of milliseconds. Jitters of spike times 
are introduced in the temporal patterns during learning to test the network's ability 
to classify time varying patterns. Such experiments have been conducted with liquid 
state machines where readout neurons have been trained with ReSuMe to respond with 



associated spike trains ( [Ponulak & Kasihsla| |2010| ). In this paper, we show that such 
classification tasks can be achieved with feed-forward networks without the need of 
larger networks such as reservoirs. 
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Figure 2: The accuracy on noisy patterns: (a) The network has been trained with 10 non- 
noisy patterns that span over 100 ms. (b) The network has been trained with 3 noisy 
patterns that span over 500 ms. During learning the noisy input patterns are generated 
by moving each spike within a gaussian distribution with mean and standard deviation 
4 ms. 



Technical details 

Three random patterns are fed into the network through 100 input spiking neurons. The 
hidden layer contains 210 neurons and the patterns are classified by a single output 
neuron. The input patterns are generated by a pseudo Poisson process with a constant 
firing rate of r = 0.1 within a 500 ms time period, where the spike trains are chosen 
so that they contain between 15 and 20 spikes. For the spike train generation an inter 
spike interval is set to 5 ms. As in the previous experiment, in order to ensure that 
a solution exists, the target patterns are generated as the output of a spiking neural 
networks initialised with a random set of weights. The target spike trains are chosen so 
that they contain at least five spikes and no more than seven spikes. The input and target 
patterns are distributed over such large periods of time in order to simulate complex 
forms of temporal processing, such as speech recognition, that spans over hundreds of 
milliseconds ( |Mauk & Buonomano[ 12004] ). 

During learning, for each iteration noisy versions of the input patterns are generated 
by moving each spike by a time interval within a gaussian distribution with mean and 
standard deviation varying in the range of 1 to 4 ms. The spikes in the target patterns are 
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also shifted by a time interval within a gaussian distribution with mean and standard 
deviation 1 ms independent of the noise level in the input patterns. 

A minimum average error of 0.6 for each pattern pair is required for the learning to 
be considered successful. During each iteration, the network is tested against a new set 
of 30 random noisy patterns; in order for the learning to be considered converged the 
network must also correctly classify at least 80% of noisy patterns. The spike times of 
the testing patterns are shifted with the same distribution as the training patterns. 

Table [8] shows the convergence rate for each experiment, where the average number 
of iterations is calculated over the successful trials. The table also shows the number of 
successful trials when the network is trained on non-noisy patterns. When the network 
is trained with a low amount of noise in the input patterns, the learning algorithm per- 
forms slightly better than the network trained with patterns without noise. The network 
is able to learn even when the spike train patterns are distorted with 3 or 4 ms. 

Table 8: Summarised results for learning with noisy patterns. The input patterns jitter 
is varied between and 4 ms, while the output jitter is always 1 ms. 



Inputs jitter 


Successful 


Average number 


during learning 


trials [%] 


of iterations 





96 


10 ± 1.2 


1 


98 


12 ± 1.1 


2 


95 


19 ±2.3 


3 


66 


26 ±5.6 


4 


64 


115 ±51 



Figure |2j5 shows the accuracy rates on a trained network against a random set of 150 
different noisy patterns, generated from the three original input patterns. The network 
is trained on input patterns where the spikes are moved within a gaussian distribution 
with mean and variance 4 ms. The graph shows the accuracy rates on patterns with 
the spikes moved within a gaussian distribution with mean and variance between 1 
and 10 ms. The graph also shows the network response on the non-noisy patterns. The 
accuracy rates are similar for all input pattern jitter. The network is able to recognise 



29 



more than 50% (again above the random performance level of 33%) of the input patterns 
even when these are distorted with up to 10 ms. 



6 Discussion 



The multilayer ReSuMe permits training spiking neural networks with hidden layers 
which brings additional computational power. On one hand, the ReSuMe learning rule 
applied on a single layer ( Ponulak & KasihsE] 2010) with 12 to 16 delays for each 
connection is not able to learn the XOR problem with the early and late timing patterns 



(see Section 5.1 ). Although the algorithm is able to change the weights in the correct 
direction, the network never responds with the correct output for all four input patterns. 
The additional hidden layer permits the network to learn the XOR problem (see Section 



5.1 1. On the other hand, a spiking neural network with the same number of units in each 
layer, but with 16 sub-connections trained with SpikeProp on the XOR patterns needs 



250 iterations to converge (Bohte et al. 2002). Our simulations (not presented in this 



paper) with a similar setup to the experiments in 5.1 confirm this result. Furthermore, 
SpikeProp requires 16 delayed sub-connections instead of just 12, hence, also implies 
more weights changes need to be computed. Also, SpikeProp only matches the time 
of the first target spike, ignoring any subsequent spikes; unlike SpikeProp, our learning 
algorithm also matches the exact number of output spikes. 

Moreover, studies on SpikeProp show that the algorithm is unstable affecting the 
performance of the learning process ( [Takase et al.[|2009||Fujita et aL||2008| ). Our learn- 
ing algorithm is based on weight modifications that only depend on the timing of pattern 
pairs and not the specific neuron dynamics, therefore is more stable than SpikeProp (see 
Figure [T]). This can be seen in the direct comparison on the XOR benchmark. Although 
our algorithm also matches the exact number of spikes as well as the precise timing of 
the target pattern, the network learns all the patterns faster. 

The learning algorithm presented here permits using different encoding methods 
with temporal patterns. In section 5^ the Iris data set is encoded using four input 



neurons, instead of 50 neurons required by a population encoding (Bohte et al. , 2002 ). 
The simpler encoding of the Iris flower dimensions allows the network to learn the 
patterns in 5 times less iterations than with a population encoding used with SpikeProp 



30 



dBohteet al.j 12002] ) . 

When moving from rate coded neurons to spiking neurons, an important question 



about the encoding of patterns arises. One encoding was proposed by Bohte et al. 



( 2002[ ), where logical and 1 are associated with the timing of early and late spikes 
respectively. As the input neuron's activity is very sparse, the spikes must be multi- 
plied over the simulated time period, as it is known that ReSuMe performs better with 
more inputs (Ponulak & Kasihski, 2010| ). This is achieved by having multiple sub- 
connections for each input neuron that replicates the action potential with a different 
delay. The additional sub-connections, each with a different synaptic strength, require 
additional training. This encoding also requires an additional input neuron to set the 
reference start time (Sporea & Griining, 2011[ ). Moreover, when looking at the weights 
after the learning process, only some of the delayed sub-connections have a major con- 
tribution to the postsynaptic neuron while others have relatively much smaller absolute 
values. 

The alternative to this encoding is to associate the patterns with spike trains. In 
order to guarantee that a set of weights exist for any random target transformation with- 
out replicating the input signals, a relatively large number of input neurons must be 
considered. As the input pattern is distributed over several spike trains, some of the 
information might be redundant and would not have a major contribution to the output. 
Moreover, such an encoding does not require an additional input neuron to designate 
the reference start time, as the patterns are encoded in the relative timing of the spikes. 



The experiment in section 5.3 shows that this encoding can be successfully used for 
non-linear pattern transformations. 



In the classification task in section 5.4 where the network is trained on 10 spike- 
timing pattern pairs, the learning algorithm converges with a higher rate as the hidden 
layer increases in size. SpikeProp can also be applied to multilayer feed- forward net- 
works but this algorithm is limited to neurons firing a single spike ( [Bohte et"aL . 2002). 

The simulations performed on classification tasks where noise was added to the 
spike-timing patterns show that the learning is robust to the variability of spike timing. 
A spiking neural network trained on non-noisy patterns can recognise more than 50% 
of noisy patterns if the timing of spikes is shifted with a gaussian distribution with 
variance up to 4 ms (see Figure |2^), when the network is trained on noisy patterns, it 
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can recognise more than 50% of noisy patterns where the timing of spikes is moved 
within a gaussian distributions with variance 10 ms (see Figure [2|5). 

Another advantage of the learning rule is the introduction of synaptic scaling. Firstly, 
it solves the problem of finding the optimal range for weight initialisation. This prob- 
lem is acknowledged as critical for the convergence of the learning (Bohte et al. , 2002[ ). 
Secondly, synaptic scaling maintains the firing activity of neurons in the hidden and 
output layer within an optimal range during the learning process. Although the firing 
rate of the output and hidden neurons is also adjusted by the non-correlative term a in 



equations ( [16] ) and ([27]), this is done only when the output firing rate does not match 
exactly the target firing rate. This can cause hidden neurons to become quiescent (neu- 
rons that do not fire any spike) during the learning process and not to contribute to the 
activity of the output neurons. Synaptic scaling eliminates this kind of problems by 
setting a minimum firing rate of one spike. 



7 Conclusion 

This paper introduces a new algorithm for feed-forward spiking neural networks. The 
first supervised learning algorithm for feed-forward spiking neural networks, Spike- 



Prop, only considers the first spike of each neuron ignoring all subsequent spikes (Bo 



hte et al.[ 2002). An extension of SpikeProp allows multiple spikes in the input and 



hidden layer, but not in the output layer ( jBooij & tat Nguyen] |2005[ ). Our learning rule 



is, to the best of our knowledge, the first fully supervised algorithm that considers mul- 
tiple spikes in all layers of the network. Although ReSuMe allows multiple spikes, the 
algorithm can only be apphed to single layer networks or to train readout neurons in 
liquid state machines ( Ponulak & KasihsE] 2010). The computational power added by 



the hidden layer permits the networks to learn non-linear problems and complex classi- 
fication tasks without using a large number of spiking neurons as liquid state machines 
do, or without the need of a large number of input neurons in a two layered network. 
Because the learning rule presented here extends the ReSuMe algorithm to multiple 
layers, it can in principle be applied to any neuron model, as the weight modification 
rules only depend on the input, output and target spike trains and does not depend on 
the specific dynamics of the neuron model. 
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Appendix: Details of simulations 



Neuron model 

The computing units of the feed-forward network used in all simulations are described 



by the Spike Response Model (SRM) (Gerstner[ 2001 1. SRM considers the spiking 



neuron as a homogeneous unit that fires an action potential, or a spike, when the to- 
tal excitation reaches a certain threshold, The neuron is characterised by a single 
variable, the membrane potential, u{t) at time t. 

The emission of an action potential can be described by a threshold process as 
follows. The spike is triggered if the membrane potential u(t) of neuron reaches the 
threshold 1!) at time t-^: 

u{tf)=^ and ^u{tf)>0 (35) 

In the case of a single neuron j receiving input from a set of presynaptic neurons 
i E Tj, the state of the neuron is described as follows: 



iel k 



where yi is the spike response function of the presynaptic neuron i E I, and Wji is the 
weight between neurons i and j; tj is the last firing time of neuron j. The kernel 77 (t) 
includes the form of the action potential as well as the after-potential: 

ri{t) = -^exp I --] (37) 



where r.,, > is the membrane time constant, with r](t) = for t < 0. 

The unweighted contribution of a single synaptic to the membrane potential is given 

by: 

?/'w = EK^~^0 ^^^^ 

/ 

with e{t) is the spike response function with 6{t) = for t < 0. The times t{ represent 
the firing times of neuron i. In our case the spike response function 6{t) describes a 
standard post- synaptic potential: 
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= -exp ( 1 - - ) (39) 



r \ r ^ 

where r > models the membrane potential time constant and determines rise and 
decay of the function. 



Network error 



The network error for one pattern is defined in terms of the van Rossum distance be- 



tween each output spike train and each target spike train ( |van Rossum[|2001[ ). The error 
between the target spike train and the actual spike train is defined as the Euclidean dis- 
tance of the two filtered spike trains (van Rossum, 2001[ ). The filtered spike train is 
determined by an exponential function associated with the spike train: 



j(t) = J2 exp[-(t - ti)/r,]H{t - U 



(40) 



where tj are the times of the spikes, and H{t) is the Heaviside function, is the time 
constant of the exponential function. Tc is chosen to be appropriate to the inter spike 
interval of the output neurons ( van Rossum[ 2001[ ). In the following simulations the 
output neurons are required to fire approximately one spike in 10 ms, thus Tc = 10 ms. 
The distance between two spike trains is the squared Euclidean distance between these 
two functions: 



1 



r, 



[fit) - git)rdt 



(41) 



c Jo 



where the distance is calculated over a time domain [0, T] that covers all the spikes in 
the system. The van Rossum distance is also used to determine the output pattern during 
learning and testing. The output pattern is determined as the closest to one of the target 
patterns in terms of the van Rossum distance. 
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